Parsing XML Documents with Xbasic

Description

Alpha Anywhere has a powerful XML parser built-in that can be used as an alternative to the Microsoft XML parser. The advantage of the Xbasic XML parser is that it can use all of the powerful string functions in Xbasic. The Microsoft XML parsers are more complex to use because you have to use OLE Automation.

Parsing XML Documents

With the Alpha Anywhere XML parser you can:

  • Extract information from XML data

  • Transform XML data (much like an XSLT)

  • Add elements and attributes to XML data

  • Remove elements and attributes from XML data

  • Change attribute values in XML data

  • Reorder elements in XML data

Using the Xbasic XML Parser - A Tutorial

Assume that you have an XML file with the following data in it:

<employee>
    <name city="Boston">
        <firstname>Frank</firstname>
        <lastname>Smith</lastname>
    </name>
    <name city="Ithaca">
        <firstname>Milton</firstname>
        <lastname>Jones</lastname>
    </name>
</employee>

The following sample Interactive window session shows some of the features of the XML parser:

'Get a new instance of the XML parser
sm = xmlSchemaManager.Get()
'Load the XML file into a variable
 xml = get_from_file("c:\data\testxml.xml")
'Load the XML data into the XML parser.
'The .LoadXML() method or the .LoadUnBalancedXML() method can be used.
'If you want to parse HTML (where unbalanced tags are allowed), then the .LoadUnBalancedXML() method
'should be used
 dom = sm.LoadXML(xml)

TIP: Xbasic also provides a simple high level function to get a parsed XML document in a single step (without having to first instantiate the xmlSchemaManager object). The *XML_Document() function can also be used. The following single Xbasic command is equivalent to the commands above parse the XML document:

dom = *XML_Document(xml)

Once you have loaded the XML data into the schema manager (i.e. you have parsed the XML - it is loaded into a 'parse tree'), you can start examining the properties and working with the methods of the schema manager. For example:

'The .top property references the outermost element.

'All elements have an .OuterXML property (among many other properties)

'Notice that the bubble help in the Interactive window will show you all of the properties.

 

?dom.top.OuterXML

= <employee>
     <name city="Boston">
         <firstname>Frank</firstname>
         <lastname>Smith</lastname>
     </name>
     <name city="Ithaca">
         <firstname>Milton</firstname>
         <lastname>Jones</lastname>
     </name>
    </employee>

When the XML is parsed an array of all of the elements is automatically created.

'The XML data has 7 elements

?dom.all.size()

= 7

 

'The .Output() method can be used to dump information from the XML parse tree.

'The .Output() method can use specially named symbols in the output expression. 

'For example, *elementId is the index into the 'all' array (seen above), and

'*tag is the name of the element. The output expression is a standard Xbasic expression.

 

?dom.Output("*elementId +' ' +*tag +crlf()")
 = 1 employee
 2 name
 3 firstname
 4 lastname
 5 name
 6 firstname
 7 lastname

 

 

 

'Note that in this output expression, 'city' is an attribute value (we can tell because it does not start with *)

'Any 'fields' in the output expression that don't start with * are attribute names.

'Notice that only elements 2 and 5 have values for the 'city' attribute because the 'city'

'attribute is only defined for the 'name' element.

?dom.Output("*elementId +' ' +*tag + ' - ' + city + crlf()")
 = 1 employee - 
 2 name - Boston
 3 firstname - 
 4 lastname - 
 5 name - Ithaca
 6 firstname - 
 7 lastname - 
  

'The 'all' array contains all of the XML elements

'The outerXML property is the entire XML string for that element

?dom.all[5].outerxml
 = <name city="Ithaca">
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>
 </name>
  

 

'The innerXML property is the XML that is contained by that element.
 ?dom.all[5].innerxml
 = 
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>

Once you have parsed the XML, it is easy to modify it. For example, here is how we can delete elements in the XML data. This section shows how nodes in the XML tree can be 'marked' and then deleted.

'Make sure all elements are initially unmarked

'If .T. is returned, then there were some marked elements.

'If .F. is returned, then there were no marked elements.

'In this case since we have not previously marked any elements, .F. is returned.

?dom.UnmarkAllElements()
 = .F.
  

'Now mark all elements that have a 'city' attribute equal to 'Boston'

'Note that the query expression is a standard Xbasic filter expression

'No need for cryptic XPath syntax!!

'The second argument is set to .T.. This causes the child elements of each found

'element to also be marked.

'The method returns .T. in this case, indicating that at least one match was found.

?dom.MarkElements("city='boston'",.t.)
 = .T.

 

'Now delete the marked elements
 dom.DeleteMarked()

  

'Examine the resulting XML

'It looks as expected, but it contains blank rows where the deleted elements were.
 ?dom.top.OuterXML
 = <employee>

     <name city="Ithaca">
         <firstname>Milton</firstname>
         <lastname>Jones</lastname>
     </name>
 </employee>

  

'Call the .reformat() method and examine the XML again
 dom.Reformat()
 ?dom.top.OuterXML
 = <employee>
     <name city="Ithaca">
         <firstname>Milton</firstname>
         <lastname>Jones</lastname>
     </name>
  </employee>

Xbasic has several ways in which you can navigate the XML DOM after it has been parsed. First, lets find out how many nodes exist at the top level of the XML tree.

?dom.top.children.size()
 = 2

There are two nodes. Let's get a pointer to the first node.

c1 = dom.top.children[1]

This object now has several properties, one of which is 'OuterXML'. This property can be read, and set. Let's first examine it:

?c1.OuterXML
 = <name city="Boston">
 <firstname>Frank</firstname>
 <lastname>Smith</lastname>
 </name>

Now, let's set it to a new value:

c1.OuterXML = "<name city=\"Atlanta\"></name>"

And now let's examine the entire XML tree to see our change:

?dom.top.OuterXML
 = <employee>
 <name city="Atlanta"/>
 <name city="Ithaca">
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>
 </name>
 </employee>

Note that the 'InnerXML' property is similar to the 'OuterXML' property, but it does not include the enclosing tags:

?dom.top.children[2].innerXML
 = 
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>

Each element in the XML tree can have an arbitrary number of attributes. You can read and set these attributes, and you can create new attributes. In our example, 'city' is an attribute of the 'Name' element. To find out how many attributes a particular element has, get a pointer to the element and then use the .attribute.size() method. For example, let's examing the second element in our XML tree. First, get a pointer to the element:

c2 = dom.top.children[2]

See how many attributes this element has:

?c2.attribute.size()
 = 1

Check to see if a particular attribute of this element exists:

?c2.AttributeExists("city")
 = .T.

Now, get the attribute's value:

?c2.AttributeGet("city")
 = "Ithaca"

Now, set the attribute to a new value:

c2.AttributeSet("city","Binghamton")

Inspect the element's 'OuterXML' to confirm that the change was made:

?c2.OuterXML
 = <name city="Binghamton">
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>
 </name>

Now create a new attribute for the element:

c2.AttributeSet("population","123000")

And again, inspect the element's 'OuterXML':

?c2.OuterXML
 = <name city="Binghamton" Population="123000">
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>
 </name>

If we check the size of the 'attribute' array, we see that it now has two entries:

?c2.attribute.size()
 = 2

We can read the name of the value of any entry in the 'attribute' array:

?c2.attribute[1].name 
 = "city"
 ?c2.attribute[1].value 
 = "Binghamton"

Attributes can be dropped:

c2.AttributeDrop("population")
 ?c2.OuterXML
 = <name city="Binghamton">
 <firstname>Milton</firstname>
 <lastname>Jones</lastname>
 </name>

Running Queries against the XML Data

  • Element Queries

    The Xbasic XML parser lets you run queries against your XML file using familiar Xbasic query syntax. There is no need to learn complicated QPath syntax, which is normally used to query XML files. In the following example, we run element queries. You can also run attribute queries, which we show later. The .queryElement() method is used to run element queries. This method creates an 'element array' - an array of all elements that match the query expression. (You can think of an element query as returning a 'pruned' version of the DOM - without a top node.) In this example, we find all elements that have a tag name of 'firstname':

    q2 = dom.QueryElement("*tag = 'firstname'")
  • We can find out how many elements were found by calling the .size() method of the 'all' object:

    ?q2.all.size()
     = 2
  • We can dump the values of the elements:

    ?q2.Output("*value+crlf()")
     = Frank
     Milton
  • Now, let's do a more complex search. Notice that we are using familiar Xbasic filter syntax:

    q3 = dom.QueryElement("*tag = 'name' .and. city = 'boston'")
     ?q3.all.size()
     = 1
    
     ?q3.all[1].outerXML
     = <name city="Boston">
     <firstname>Frank</firstname>
     <lastname>Smith</lastname>
     </name>
  • Note that the query object is still linked to the XML parser. Any changes made to the XML when working with the results of a query will be reflected in the full XML tree (i.e. the XML shown by dom.top.OuterXML in these examples).

  • Attribute Queries

    Attribute queries are typically less useful. An attribute query returns an 'attribute array' - an array of all attributes that match the 'filter' expression. (In the case of an attribute query, you do not enter a logical filter expression. Instead, you specify a CR-LF delimited list of attribute names). Let's add a new attribute to our XML and then do an attribute query:

    c1 = dom.top.children[1]
    c1.AttributeSet("state","MA")
    qa1 = dom.queryAttributes("state")
    ?qa1.all.size()
    =1
    ?qa1.all[1].value
    = "MA"
  • Now, let's search for multiple attributes:

    qa1 = dom.queryAttributes("state" + crlf() + "name")
    ?qa1.all.size()
    =3
    ?qa1.all[1].value
    = "MA"
    ?qa1.all[1].name 
    = "state"
  • The .DumpFormat(), .GetValues() and .SetValues() methods can be used with the result of an Attribute query, as shown below: Here we dump out the element name, attribute name and value of each item in the array returned by the attribute query:

    ?qa1.DumpFormat("E.N=V" + crlf() )
     = name.state=MA
     name.city=Boston
     name.city=Ithaca
  • Here we get a CR-LF delimited list of attribute values:

    ?qa1.GetValues()
     = MA
     Boston
     Ithaca
  • Now, lets do some transformation on these values:

    list = qa1.GetValues()
     list = alltrim( upper(list) )
     qa1.SetValues(list)
    
     ?dom.top.OuterXML
     = <employee>
     <name city="BOSTON" state="MA">
     <firstname>Frank</firstname>
     <lastname>Smith</lastname>
     </name>
     <name city="ITHACA">
     <firstname>Milton</firstname>
     <lastname>Jones</lastname>
     </name>
     </employee>
  • Important: If your XML tree does not have a top element, the XML parser will automatically insert one.

XML Parser Expression Reserved Words

The following table show a list of reserved words that can be used where expressions can be used (i.e. with the .Output(), .QueryElement() .MarkElements(), UnmarkElements(), .Resolve(), and .FindElement() methods)

  • *elementId

    Index into the all[] array. This is the order in which the elements appear in the XML data.

  • *value

    Contents of the XML element as raw text

  • *xml

    Inner XML of an element

  • *istop

    Returns .T. if the element is the top most element

  • *tagnumber

    child number of the element within the current parent element

  • *tagcount

    number of siblings for this element

  • *isleaf

    Returns .T. if an element has no children

  • *depth

    How many nodes deep is the current element (if *istop is .T., then *depth is 1)

  • *marked

    Returns .T. if the current element has been marked. Tags are marked by setting the element's .Marked property, or by called the .MarkElements() method.

  • *fullname

    The fully qualified tag name. A dot separated list of this tag name and all of the parents. Assume you have an <employees> tag with a child <name> tag, with a child <firstname> tag. The *fullname of the 'firstname' tag is 'employees.name.firstname'.

  • *tag

    The current element name. The *tag reserved words can be used with the following 'navigation' directives. Navigation directives are delimited with periods:

      • *parent - the parent tag

      • *prev - the previous sibling

      • *next - the next sibling

      • *first - the fist sibling on the current branch

      • *last - the last sibling on the current branch.

    • Navigation directives can be nested to an arbitrary depth. For example:

    • This sytax can also be used to get attribute values. For example, the following syntax will get the value of the 'city' attribute from the current element's parent element.

      • *tag.*parent.city

The following example shows how the navigation directives can be used to create complex output:

q1 = dom.QueryElement("*tag = 'firstname'")
 ?q1.Output("*value + ' ' + *tag.*next.*value + ' from ' + *tag.*parent.city + crlf()")
 = Frank Smith from BOSTON
 Milton Jones from ITHACA

Marking Elements in the XML Tree

Marking elements is useful because it allows you to move and delete elements from the XML tree. You can use an Xbasic filter expression to select the elements that you want to mark. Having marked elements, you can then methods to move and delete the marked elements. For example:

?dom.top.OuterXML
 = <employee>
     <name city="BOSTON" state="MA">
         <firstname>Frank</firstname>
         <lastname>Smith</lastname>
     </name>
     <name city="ITHACA">
         <firstname>Milton</firstname>
         <lastname>Jones</lastname>
     </name>
 </employee>

Now, let's mark the element that has a city attribute equal to 'Boston':

?dom.MarkElements("city='boston'")
 = .T.

Note that .t. indicates that at least one element was selected and marked. This shows that the first element of the top parent was marked, and the second was not marked.

?dom.top.children[1].marked 
 = .T.
 ?dom.top.children[2].marked 
 = .F.

Get a pointer to the second element of the top parent

c2 = dom.top.children[2]

Now move all of the marked elements after this element:

c2.MoveMarkedAfter()

And here is how the XML tree has been transformed:

?dom.top.OuterXML
 = <employee>

<name city="ITHACA">
     <firstname>Milton</firstname>
     <lastname>Jones</lastname>
 </name>
 <name city="BOSTON" state="MA">
     <firstname>Frank</firstname>
     <lastname>Smith</lastname>
 </name></employee>

The following methods are useful for working with marked elements:

XML Helper Functions

A common requirement with working with XML data is to quickly extract some attribute values from an XML element. Xbasic provides a function to do this. The *XML_PEEK_ATTRIBUTE() function can be used to extract attribute values from a top level element in a XML element. The following examples demonstrate the function.

xml = "<data city='Boston' firstname='Fred' lastname='Smith'/>"

 ?*XML_PEEK_ATTRIBUTE(xml,"city")
 = "Boston"

?*XML_PEEK_ATTRIBUTE(xml,"firstname")
 = "Fred"